10 research outputs found

    E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks

    Full text link
    Recurrent Neural Networks (RNNs) are a key technology for emerging applications such as automatic speech recognition, machine translation or image description. Long Short Term Memory (LSTM) networks are the most successful RNN implementation, as they can learn long term dependencies to achieve high accuracy. Unfortunately, the recurrent nature of LSTM networks significantly constrains the amount of parallelism and, hence, multicore CPUs and many-core GPUs exhibit poor efficiency for RNN inference. In this paper, we present E-PUR, an energy-efficient processing unit tailored to the requirements of LSTM computation. The main goal of E-PUR is to support large recurrent neural networks for low-power mobile devices. E-PUR provides an efficient hardware implementation of LSTM networks that is flexible to support diverse applications. One of its main novelties is a technique that we call Maximizing Weight Locality (MWL), which improves the temporal locality of the memory accesses for fetching the synaptic weights, reducing the memory requirements by a large extent. Our experimental results show that E-PUR achieves real-time performance for different LSTM networks, while reducing energy consumption by orders of magnitude with respect to general-purpose processors and GPUs, and it requires a very small chip area. Compared to a modern mobile SoC, an NVIDIA Tegra X1, E-PUR provides an average energy reduction of 92x

    Neuron-level fuzzy memoization in RNNs

    Get PDF
    The final publication is available at ACM via http://dx.doi.org/10.1145/3352460.3358309Recurrent Neural Networks (RNNs) are a key technology for applications such as automatic speech recognition or machine translation. Unlike conventional feed-forward DNNs, RNNs remember past information to improve the accuracy of future predictions and, therefore, they are very effective for sequence processing problems. For each application run, each recurrent layer is executed many times for processing a potentially large sequence of inputs (words, images, audio frames, etc.). In this paper, we make the observation that the output of a neuron exhibits small changes in consecutive invocations. We exploit this property to build a neuron-level fuzzy memoization scheme, which dynamically caches the output of each neuron and reuses it whenever it is predicted that the current output will be similar to a previously computed result, avoiding in this way the output computations. The main challenge in this scheme is determining whether the new neuron's output for the current input in the sequence will be similar to a recently computed result. To this end, we extend the recurrent layer with a much simpler Bitwise Neural Network (BNN), and show that the BNN and RNN outputs are highly correlated: if two BNN outputs are very similar, the corresponding outputs in the original RNN layer are likely to exhibit negligible changes. The BNN provides a low-cost and effective mechanism for deciding when fuzzy memoization can be applied with a small impact on accuracy. We evaluate our memoization scheme on top of a state-of-the-art accelerator for RNNs, for a variety of different neural networks from multiple application domains. We show that our technique avoids more than 24.2% of computations, resulting in 18.5% energy savings and 1.35x speedup on average.Peer ReviewedPostprint (author's final draft

    Removing checks in dynamically typed languages through efficient profiling

    Get PDF
    Dynamically typed languages increase programmer's productivity at the expense of some runtime overheads to manage the types of variables, since they are not declared at compile time and can change at runtime. One of the most important overheads is due to very frequent checks that are introduced in the specialized code to identify the type of the variables. In this paper, we present a HW/SW hybrid mechanism that allows the removal of checks executed in the optimized code by performing a HW profiling of the types of object variables. To demonstrate the benefits of the proposed technique, we implement it in a JavaScript engine and show that it produces 7.1% speedup on average for optimized JavaScript code (up to 34% for some applications) and 6.5% energy reduction.Peer ReviewedPostprint (author's final draft

    Co-designed solutions for overhead removal in dynamically typed languages

    Get PDF
    Dynamically typed languages are ubiquitous in today's applications. These languages ease the task of programmers but introduce significant runtime overheads since variables are neither declared nor bound to a particular type. For efficiency reasons, the code generated at runtime is specialized for certain data types, so the types of variables require to be constantly validated. However, these specialization techniques still carry important overheads, which can adopt different forms depending on the kind of applications. This thesis proposes three hybrid HW/SW mechanisms that reduce these different forms of overhead. The first two mechanisms target the overhead produced during the execution of the specialized code, which is characterized by the frequent execution of checking operations that are used to verify some assumptions about the object types. The first technique improves the performance by reducing the number of instructions used to perform these checks. The second technique is based on a novel dynamic type-profiling scheme that removes most of these checks. The third technique targets the overhead due to the execution of the non-optimized code, which performs an important amount of profiling for future optimizations. We present a hybrid HW/SW mechanism that reduces the cost of computing the addresses of object properties in a very efficient manner. This is an innovative approach that significantly improves the speculative strategy currently adopted by state-of-the-art dynamic compilers.Los lenguajes dinámicamente tipificados están muy presentes en las aplicaciones de hoy en día. Aunque estos lenguajes facilitan las tareas de programación, también sufren de importantes costes adicionales en la ejecución del programa. Esto es debido a que los tipos de las variables no son declarados en el programa y a que dichas variables pueden contener elementos de más de un tipo distinto durante la misma ejecución. Consecuentemente el tipo de las variables debe ser evaluado en tiempo de ejecución. Para intentar minimizar estos costes adicionales, se emplean técnicas de especialización de código a partir de información dinámica de tipos. Sin embargo este código especializado requiere validar estos tipos constantemente para cada vez que se produce un acceso a una variable. Aunque los costes adicionales de estos lenguajes son reducidos notablemente con estas técnicas de especialización de código, éstos siguen teniendo un factor importante en la ejecución, de modo que el rendimiento de estas aplicaciones sigue siendo considerablemente menor en comparación con los lenguajes estáticamente tipificados. Por otra parte, estos costes adicionales adoptan una forma distinta según el tipo de aplicación que se ejecute. Esta tesis propone tres mecanismos co-diseñados en SW y en HW que permiten reducir estos costes adicionales para los dos tipos de aplicaciones más comunes en los lenguajes dinámicamente tipificados. Dos de las técnicas propuestas reducen los costes adicionales para un tipo de aplicaciones regulares, de larga duración y con un alto reúso de código. Este tipo de aplicaciones ejecutan código especializado durante una parte muy importante del tiempo, lo que les permite alcanzar un rendimiento bastante más próximo al alcanzado por las aplicaciones escritas en lenguajes estáticamente tipificados. Los costes adicionales de este código especializado consisten básicamente en las verificaciones en tiempo de ejecución de los tipos de las variables cuando éstas son accedidas. Teniendo en cuenta estos costes, la primera técnica propuesta reduce el número de instrucciones y latencias ejecutadas para realizar estas operaciones de verificación de tipo de las variables. Por otro lado, la segunda técnica está basado en un mecanismo dinámico que permite recolectar información de tipos de forma eficiente, con el fin de eliminar de la ejecución un subconjunto de estas operaciones de verificación de tipo. La tercera técnica propuesta reduce los costes adicionales para aplicaciones de corta duración y con un bajo reúso de código, las cuales están más relacionadas con la gestión rápida de eventos en un entorno web. Este tipo de aplicaciones ejecutan un porcentaje menor de código especializado respeto a las primeras, de modo que el tipo de costes adicionales más predominante consiste en las tareas de desambiguación del tipo de las variables cuando éstas son accedidas. Teniendo en cuenta esto, nosotros hemos propuesta un mecanismo HW/SW híbrido que permite eliminar por completo la mayor parte de estas tareas para el caso de los accesos en modo lectura de los métodos o atributos de un objeto. Este es un esquema novedoso que mejora significativamente la técnica del estado del arte de los lenguajes dinámicamente tipificados.Postprint (published version

    Analysis and optimization of engines for dynamically typed languages

    No full text
    Dynamically typed programming languages have become very popular in the recent years. These languages ease the task of the programmer but introduce significant overheads since assumptions about the types of variables have to be constantly validated at run time. Java Script is a widely used dynamically typed language that has gained significant popularity in recent years. In this paper, we provide a detailed analysis of the two main sources of overhead in the Java Script execution. The first one is the runtime overhead needed for dynamic compilation and housekeeping activities (i.e. Garbage collector, compilation, etc.). The second one is the additional checks and guards introduced by the dynamic nature of Java Script. Then, we propose three new HW/SW optimizations that reduce this latter type of overhead. We show that these two types of overhead represent 35% and 25% respectively of the total execution time on average for a representative workload, and the proposed optimizations provide a 6% average speedup.Peer ReviewedPostprint (published version

    Analysis and optimization of engines for dynamically typed languages

    No full text
    Dynamically typed programming languages have become very popular in the recent years. These languages ease the task of the programmer but introduce significant overheads since assumptions about the types of variables have to be constantly validated at run time. Java Script is a widely used dynamically typed language that has gained significant popularity in recent years. In this paper, we provide a detailed analysis of the two main sources of overhead in the Java Script execution. The first one is the runtime overhead needed for dynamic compilation and housekeeping activities (i.e. Garbage collector, compilation, etc.). The second one is the additional checks and guards introduced by the dynamic nature of Java Script. Then, we propose three new HW/SW optimizations that reduce this latter type of overhead. We show that these two types of overhead represent 35% and 25% respectively of the total execution time on average for a representative workload, and the proposed optimizations provide a 6% average speedup.Peer Reviewe

    E-PUR: an energy-efficient processing unit for recurrent neural networks

    No full text
    Recurrent Neural Networks (RNNs) are a key technology for emerging applications such as automatic speech recognition, machine translation or image description. Long Short Term Memory (LSTM) networks are the most successful RNN implementation, as they can learn long term dependencies to achieve high accuracy. Unfortunately, the recurrent nature of LSTM networks significantly constrains the amount of parallelism and, hence, multicore CPUs and many-core GPUs exhibit poor efficiency for RNN inference. In this paper, we present E-PUR, an energy-efficient processing unit tailored to the requirements of LSTM computation. The main goal of E-PUR is to support large recurrent neural networks for low-power mobile devices. E-PUR provides an efficient hardware implementation of LSTM networks that is flexible to support diverse applications. One of its main novelties is a technique that we call Maximizing Weight Locality (MWL), which improves the temporal locality of the memory accesses for fetching the synaptic weights, reducing the memory requirements by a large extent. Our experimental results show that E-PUR achieves real-time performance for different LSTM networks, while reducing energy consumption by orders of magnitude with respect to general-purpose processors and GPUs, and it requires a very small chip area. Compared to a modern mobile SoC, an NVIDIA Tegra X1, E-PUR provides an average energy reduction of 88x.Peer ReviewedPostprint (published version

    E-PUR: an energy-efficient processing unit for recurrent neural networks

    No full text
    Recurrent Neural Networks (RNNs) are a key technology for emerging applications such as automatic speech recognition, machine translation or image description. Long Short Term Memory (LSTM) networks are the most successful RNN implementation, as they can learn long term dependencies to achieve high accuracy. Unfortunately, the recurrent nature of LSTM networks significantly constrains the amount of parallelism and, hence, multicore CPUs and many-core GPUs exhibit poor efficiency for RNN inference. In this paper, we present E-PUR, an energy-efficient processing unit tailored to the requirements of LSTM computation. The main goal of E-PUR is to support large recurrent neural networks for low-power mobile devices. E-PUR provides an efficient hardware implementation of LSTM networks that is flexible to support diverse applications. One of its main novelties is a technique that we call Maximizing Weight Locality (MWL), which improves the temporal locality of the memory accesses for fetching the synaptic weights, reducing the memory requirements by a large extent. Our experimental results show that E-PUR achieves real-time performance for different LSTM networks, while reducing energy consumption by orders of magnitude with respect to general-purpose processors and GPUs, and it requires a very small chip area. Compared to a modern mobile SoC, an NVIDIA Tegra X1, E-PUR provides an average energy reduction of 88x.Peer Reviewe

    Neuron-level fuzzy memoization in RNNs

    No full text
    The final publication is available at ACM via http://dx.doi.org/10.1145/3352460.3358309Recurrent Neural Networks (RNNs) are a key technology for applications such as automatic speech recognition or machine translation. Unlike conventional feed-forward DNNs, RNNs remember past information to improve the accuracy of future predictions and, therefore, they are very effective for sequence processing problems. For each application run, each recurrent layer is executed many times for processing a potentially large sequence of inputs (words, images, audio frames, etc.). In this paper, we make the observation that the output of a neuron exhibits small changes in consecutive invocations. We exploit this property to build a neuron-level fuzzy memoization scheme, which dynamically caches the output of each neuron and reuses it whenever it is predicted that the current output will be similar to a previously computed result, avoiding in this way the output computations. The main challenge in this scheme is determining whether the new neuron's output for the current input in the sequence will be similar to a recently computed result. To this end, we extend the recurrent layer with a much simpler Bitwise Neural Network (BNN), and show that the BNN and RNN outputs are highly correlated: if two BNN outputs are very similar, the corresponding outputs in the original RNN layer are likely to exhibit negligible changes. The BNN provides a low-cost and effective mechanism for deciding when fuzzy memoization can be applied with a small impact on accuracy. We evaluate our memoization scheme on top of a state-of-the-art accelerator for RNNs, for a variety of different neural networks from multiple application domains. We show that our technique avoids more than 24.2% of computations, resulting in 18.5% energy savings and 1.35x speedup on average.Peer Reviewe

    Modelling HW/SW Co-Designed Processors

    No full text
    This paper presents DARCO, an extensible platform for modelling HW/SW co-designed processors with different guest and host ISAs. Its Emulation Software Layer (ESL) provides staged compilation, which translates and optimizes x86 binaries to run on a PowerPC processor. In addition to the functional models, DARCO provides timing simulators and a powerful debugging toolchain. DARCO has a functional emulation speed of 8 million x86 instructions per second
    corecore